Muino et al. Plant Methods 201 1, 7:1 1 
http://www.plantmethods.eom/content/7/1/1 1 



PLANT METHODS 



SOFTWARE Open Access 



ChlP-seq Analysis in R (CSAR): An R package for 
the statistical detection of protein-bound 
genomic regions 

Jose M Muino 1,2 *, Kerstin Kaufmann 3 , Roeland CHJ van Ham 1,2 , Gerco C Angenent 4 and Pawel Krajewski 5 
Abstract 

Background: In vivo detection of protein-bound genomic regions can be achieved by combining chromatin- 
immunoprecipitation with next-generation sequencing technology (ChlP-seq). The large amount of sequence data 
produced by this method needs to be analyzed in a statistically proper and computationally efficient manner. The 
generation of high copy numbers of DNA fragments as an artifact of the PCR step in ChlP-seq is an important 
source of bias of this methodology. 

Results: We present here an R package for the statistical analysis of ChlP-seq experiments. Taking the average size 
of DNA fragments subjected to sequencing into account, the software calculates single-nucleotide read-enrichment 
values. After normalization, sample and control are compared using a test based on the ratio test or the Poisson 
distribution. Test statistic thresholds to control the false discovery rate are obtained through random permutations. 
Computational efficiency is achieved by implementing the most time-consuming functions in C++ and integrating 
these in the R package. An analysis of simulated and experimental ChlP-seq data is presented to demonstrate the 
robustness of our method against PCR-artefacts and its adequate control of the error rate. 

Conclusions: The software ChlP-seq Analysis in R (CSAR) enables fast and accurate detection of protein-bound 
genomic regions through the analysis of ChlP-seq experiments. Compared to existing methods, we found that our 
package shows greater robustness against PCR-artefacts and better control of the error rate. 



Background 

Genome-wide identification of in vivo protein-bound 
genomic regions is essential for a full understanding of 
transcriptional regulation. DNA fragments that are 
bound by proteins in vivo can be isolated by chromatin- 
immunoprecipitation (ChIP) and subsequently identified 
using microarrays (ChlP-chip) or high-throughput 
sequencing technologies (ChlP-seq). Recent studies [1,2] 
indicate that the ChlP-seq approach provides higher 
resolution and statistical power than ChlP-chip. To date, 
only two methods have been described for the analysis of 
ChlP-seq experiments in plants, i.e. [3] and the method 
developed by our group [2,4]. 

The common approach to analyze the millions of short 
sequence reads obtained in a typical ChlP-seq 
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experiment is to map them to a reference genome using 
one of several mapping tools available, for example 
SOAPv2, Bowtie, or BWA [5-7]. Reads that map to mul- 
tiple locations in the genome, so called 'multireads' [8], 
are often discarded to avoid the ambiguity of their geno- 
mic origin. To account for varying sequencing depths 
among the different samples in an experiment, current 
methods typically standardize the number of mapped 
reads across all samples by a scaling factor. However, it is 
becoming evident that more sophisticated normalization 
procedures are needed, since differences in coverage dis- 
tribution among samples not only depend on the sequen- 
cing depth, but also on other properties of the sample 
[9], including methodological differences in library pre- 
paration, as well as biological differences in the chroma- 
tin state of the samples. We are aware of only two 
published ChlP-seq analysis methods that normalize the 
data to obtain the same coverage distribution across sam- 
ples. The PeakSeq method [10] applies a scaling factor 
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that is obtained from the linear regression between IP 
and control sample coverages, while in [11] a quantile 
normalization method is proposed. Here we describe the 
implementation of the approach introduced by our group 
[2], in which the statistical method of moments is used 
for the normalization process. 

Subsequent to normalization, enrichment of genomic 
regions is commonly evaluated with a test statistic based 
on the Poissson or Binomial distribution. To control the 
false discovery rate (FDR) of such a test, it is necessary 
to obtain the distribution of the test statistics under the 
null hypothesis. Some methods, e.g. CisGenome [12], 
assume this distribution as known a priori, given the 
statistical properties of the test. However, this assump- 
tion strongly depends on how well the distribution used 
to construct the test statistics (e.g. Poisson distribution) 
can represent the real data. Another strategy is to try to 
empirically estimate the distribution of the test statistic 
under the null hypothesis; the most common method is 
to assume that the score values obtained in the compar- 
ison of the IP sample against the control will be a good 
estimation of the desired distribution. Examples of soft- 
ware that implement this approach are PICS [13], 
MACS [14] or QuEST [15]. A second problem with the 
use of the Poisson test is that the comparison of differ- 
ent ChlP-seq experiments is not straightforward, since 
the obtained scores or /^-values will depend on the sta- 
tistical power of each particular experiment (e.g. number 
of replicates, number of reads obtained, etc.). A review 
of existing algorithms is given in [8]. 

ChIP experiments typically yield low amounts of DNA 
and therefore require a high number of PCR amplifica- 
tion cycles prior to sequencing. This increases the prob- 
ability of experimental artefacts, most importantly the 
uneven generation of high copy numbers of PCR frag- 
ments [16-18]. This effect in a given experiment can be 
estimated by measuring the percentage of non-unique 
sequence reads (hereafter referred to as "duplicate 



reads") obtained after sequencing. A high percentage of 
duplicate reads is an indication of potential problems 
due to PCR artefacts. Cell culture ChIP experiments 
yield larger amounts of DNA and can minimize the pro- 
blem, but this approach can only rarely be used in 
plants. A typical Illumina-sequenced plant IP library 
usually yields around 30%-40% duplicate reads (Table 1, 
[2,3,19]), while cell culture samples in other organisms 
typically yield a low fraction of duplicate reads (5%-10% 
[20,21]). A possible approach to handle this problem is 
to identify and discard duplicate reads. However, in 
plant experiments, this can lead to a 30%-40% data 
reduction in a standard ChlP-seq experiment [2] (Table 
1) and, consequently, to a decrease of the statistical 
power of the experiment. Also, it is expected that 
regions with a high read coverage will contain more 
duplicate reads than other regions of the same length, 
independently of PCR-artefacts. Therefore, the elimina- 
tion of duplicate reads may incorrectly change the score 
ranking of these regions. 

We present here an R package that implements the sta- 
tistical methodology previously outlined by our group 
[2,4]. The method was developed to efficiently handle 
high-copy numbers of reads that result from PCR artefacts 
without the need of eliminating duplicated sequences. The 
coverage distribution of samples is normalized to obtain 
the same mean and variance across samples. Users can 
choose between Poisson or ratio-based testing. FDR con- 
trol is achieved through the well-known method of per- 
mutations [22]. The most time-consuming functions are 
implemented in C++ and are fully integrated in the pack- 
age. A comparison with three other publically available 
methods is presented in the context of plant ChlP-seq 
analysis. 

Implementation 

The software accepts any plain text, tabular data format 
containing the following information for each mapped 



Table 1 Summary of read statistics for the ChlP-seq libraries analysed 



Library 
name* 


No. of sequenced 
reads 


No. of mapped 
reads 


No. of non-duplicated 
mapped reads 


Percentage of duplicated 
mapped reads 


SRA ID 


Sc 


4,065,558 


1,640,977 


1,047,009 


37% 


SRX004992 


s, 


3,112,455 


992,908 


525,779 


47% 


SRX004990 


S2-S5 


NA 


1,192,908 


525,779 


56% 


NA 


s 6 


614,236 


124,619 


56,619 


55% 


E-MTAB-587 


s 7 


1 ,474,956 


310,888 


79,996 


75% 


E-MTAB-587 


s 8 


4,105,326 


1,558,098 


78,434 


95% 


E-MTAB-587 


A c 


20,983,004 


11,703,244 


5,323,373 


54% 


SRX018394; 
SRX018395 


A, 


15,941,703 


1 3,293,909 


9,708,068 


27% 


SRX018392; 
SRX018393 



* For library description see text 
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read: chromosome, location (bp), strand (+/-), read 
length (bp), and number of times mapped on the gen- 
ome. Users can define specific input table formats in 
addition to the default option of the package, which 
expects the standard AlignedRead format supported by 
Bioconductor or the output of the mapping program 
SOAPv2. The average length of the DNA fragments sub- 
jected to sequencing must be provided by the user. 

In an ideal ChlP-seq experiment, sequence reads that 
truly originate from a protein-bound genomic region 
should map in a 1:1 ratio to both strands of the chromo- 
somal DNA (Figure IB). However, because some 
sequences are represented by an artificially high number 
of duplicate reads due to PCR artefacts, this ratio can be 
distorted (Figure 1C). In the default setup of our package 
(Figure 1A), uniquely mapped reads are virtually 
extended to match the average length of the DNA frag- 
ments subjected to sequencing. The number of extended 
reads that overlap each nucleotide position i is then 
counted for both strands independently, and the mini- 
mum value for both strands is taken, providing counts 
x is , where s = 1,2 for control and IP sample, respectively 
(Figure IB). Other setups allow the user to merge the 



information of both strands, or to just consider one of 
the strands in the analysis. 

Prior to the estimation of read-enrichment in an IP 
sample relative to a control sample, the data need to be 
normalized to obtain equal read-coverage distributions. 
The two main factors affecting read coverage are: 

1) Variable number of mapped reads among sequen- 
cing experiments. As commonly handled in the lit- 
erature, the CSAR package allows normalization of 
the data by reporting the number of hits per 6 mil- 
lions of reads, where 8 is an arbitrary number. 
Namely, the counts transformed to 



2) Variable number of regions sequenced. In the IP 
sample, reads will come preferentially from true 
positive and false positive protein-bound regions, 
while in the control sample, reads will come prefer- 
entially from false positive regions. This will result 
in different coverages in the IP and control samples 
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Figure 1 CSAR analysis workflow. (A) Typical analysis workflow using CSAR. (B) Mapped reads (continuous line) are virtually extended (dashed 
line) for each strand directionally. Number of extended reads that overlap each nucleotide position is counted for both strands independently, 
and the minimum value for both strands is taken as "number of hits". (C) Consequently, regions with duplicated reads mapping to only one 
strand will not be considered significant. (D) CSAR output can be visualized in a typical genome browser. 
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that should be taken into account in the analysis [9]. 
In contrast to other packages, CSAR will normalize 
the read-coverage distribution in the IP sample to 
have the same mean and variance as the control 
sample. Namely, the observed y i2 are transformed to 

0i (ya - A2) „ 
za = 7 + Ml, 

where jl s and a s denote the mean and standard devia- 
tion of y is . 

After normalization, a score (i,) is calculated for each 
nucleotide position i on the basis of the Poisson-based 
or ratio (default) test. 

For the Poisson-based test: 

/ ^ g-max( yil ,£) x max (l jyil;i 6) fe \ 

*— i°*(i-£ ) 

For the ratio test: 
max(l,y il , y S) 

The parameter fi represents the background coverage 
level of the IP sample after the value is scaled and nor- 
malized as any other value from the IP sample (see 
below). Usually, the coverage distribution of the control 
sample is not uniform with large regions showing no or 
very low coverage. These regions can be incorrectly 
declared significant since no good estimation of their 
coverage in the control can be obtained. To avoid this 
problem, the transformed counts in the regions with a 
coverage below fi in the control sample are set to the 
value of /3. The value of fi is calculated as: 

0-1 (c6 - /2 2 ) A 

P = ; ^ + Mi 

er 2 L x i2 

i 

where c is a parameter representing the coverage level 
of the IP sample before scaling and normalization. The 
value of c can be given by the user, or calculated auto- 
matically (default option) as: 




where n 0 denotes the number of genomic positions for 
which x i2 > 0. In our experience, the ratio test gives 
more comparable results among different experiments, 
which is due to the fact that its score value is less 
dependent on the statistical power of the experiments as 
for the Poisson test. 

Candidate peaks are defined as genomic regions with 
score values (tj) higher than a given cut-off. Candidate 



peaks separated by less than 100 bp (default parameter 
value) are merged. The maximum score value of the 
candidate peak is used as the test statistic value to test 
its significance. 

In contrast to other packages, CSAR subsequently uses 
a permutation method to obtain the test-statistic thresh- 
old corresponding to a desired FDR level. Individual 
mapped reads are labeled as "control" or "IP" if they 
belong to either the control or IP sample, respectively. 
The labels are then randomly permuted between the 
mapped reads, and the new permutated datasets are 
subjected to the previously described ChlP-seq analysis. 
Since this permutation process removes any relationship 
between the mapped reads and the sample they came 
from [22], the score values obtained over a sufficient 
number of permutations will provide an accurate esti- 
mation of the score distribution under the null hypoth- 
esis that can be used to control the error rate, for 
example FDR. 

CSAR can generate results regarding genomic posi- 
tions of significantly read-enriched regions and their dis- 
tance to annotated genomic features (e.g. genes, other 
annotated binding events) in tabular format. These can 
be directly used by other R functions or packages for 
further analysis or for graphical representation. The 
read-enriched genomic regions can be written to a 
UCSC web-browser compatible wiggle (wig) file and 
visualized (Figure ID) with, for example, the Integrated 
Genome Browser [23]. The default parameters in CSAR 
are optimized for Arabidopsis ChlP-seq data, but they 
can easily be adjusted for other organisms. 

Results and Discussion 

CSAR has been successfully used to analyze several 
plant ChlP-seq experiments and was shown to be com- 
putationally efficient and accurate [2,19]. Table 1 sum- 
marizes characteristics of Illumina sequence libraries 
that were reanalyzed in this study in order to compare 
the performance of CSAR (vl.4.0) with four other pub- 
licly available methods, i.e. QuEST (v2.4), PICS (vl.4.0), 
MACS (vl.4.0rc2) and Cisgenome (vl.2) [12-15]. 
SEPALLATA3 (SEP3) and APETALA1 (API) are two 
MADS-domain transcription factors involved in the reg- 
ulation of floral development in Arabidopsis thaliana. 
Datasets Si and S c represent an experimental IP and 
control libraries for a SEP3 ChlP-seq experiment [2]. S6, 
S 7 and S 8 represent sequencing libraries from the same 
IP experiment, except that low amounts of DNA were 
recovered from the ChIP step. Standard Illumina proto- 
col was used for the library preparation. S 6 , and S 7 were 
prepared according to the standard protocol and PCR 
amplified in 20 cycles. An additional second PCR ampli- 
fication step (+10 cyles) was performed to the library S 8 . 
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The amplification produced high numbers of duplicate 
reads (Table 1), with library S 8 most affected. We used 
these libraries to evaluate the robustness of our method 
against PCR artefacts. Datasets S 2 -S 5 represent in silico 
modifications of the Si library. At random, 2000 
uniquely mapped reads from Si were amplified one hun- 
dred times each and added to the original Si dataset. 
This process was repeated four times to generate the 
four dataset S 2 -S 5 . Datasets A l and A c represent the IP 
and control libraries, respectively, combining two biolo- 
gical API ChlP-seq replicates [19]. Libraries A 1 and A c 
were sequenced on the Genome Analyzer II, the others 
on the Genome Analyzer I; all libraries were sequenced 
to a 36 bp read length. Table 1 summarizes the number 
of mapped reads, as well as the percentage of duplicate 
reads present in each dataset. 

SOAPv2 (default parameters) was used to uniquely 
map reads to the Arabidopsis genome (ATH1.1- 
con.01222004; ftp://ftp.arabidopsis.org/). Reads mapping 
to the chloroplast or mitochondrial genomes were dis- 
carded. Remaining reads were analyzed with default 
parameters at an FDR level <0.05 by CSAR, QuEST, 
PICS, MACS and Cisgenome [12-15] using the appro- 
priate dataset as a control. 

Figure 2A shows the proportion of significant SEP3 
peaks declared by each method and for which a CArG 
box motif was present at a maximum distance of 50bp. 
Note that the CArG box is the known DNA binding 
motif of MADS-domain transcription factors and can 
thus be used as a validation criterion. CSAR shows a 
stronger enrichment than other methods. 

For API, publically available gene expression data 
could be used to validate peak calling. The expression 
data was generated in API induction experiments on 
the same tissue that was used in our API ChlP-seq 
experiment [19]. Figure 2B shows the percentage of sig- 
nificant API peaks declared by each method close to at 
least one potential direct target gene, where the target 
genes were as the ones which were differentially 
expressed in the time-series gene expression data [19]. 
CSAR shows a stronger enrichment than other methods. 

In order to study the robustness of each method 
against PCR artefacts, we considered the regions 
declared as significant in the comparison Si to S c for 
each evaluated method as its gold standard. A high per- 
centage of regions declared significant in the analysis of 
the in silico (S 2 -S 5 ) or experimentally (S 6 -S 8 ) modified 
Si libraries but not present in the gold standard for 
each method will indicate a lack of robustness. Table 2 
gives the number of significant regions in the different 
datasets as detected by each method. The number of 
significant regions in common in the comparison of Si 
to S c is shown, as well as the percentage of False Posi- 
tives. A "common region" is defined as a significant 



region (FDR < 0.05) located within 250 bp of a signifi- 
cant region (FDR < 0.05) in the comparison of Si to S c , 
using the same software; these common regions are 
considered as True Positives to allow for calculation of 
the percentage of False Positives. On average, 2,365 
regions were declared as significant in the comparison 
of Si to S c by the five methods. CSAR declares more 
regions significant than the other methods do. 

In the analysis of the in silico modified libraries S 2 -S 5 , 
MACS, CSAR and QuEST are the most robust methods 
with respect to the presence of high numbers of dupli- 
cate reads, as indicated by the low percentages of False 
Positives, an error rate below the 5% FDR control 
desired. A possible cause for the high percentage of 
False Positives obtained by Cisgenome in our in silico 
modified datasets might be in its FDR estimation step. 
Cisgenome assumes a Negative Binomial or a Poisson 
distribution for the score distribution under the null 
hypothesis. However, the presence of high numbers of 
duplicate reads will modify its original distribution and 
will have a strong effect on the FDR estimation. 

In the case of the experimental libraries which had high 
levels of duplicate reads (S 6 , S 7 and S 8 ), CSAR clearly 
shows a lower percentage of False Positives than all other 
packages, with an error rate close to the desired 5% FDR 
control. Because one might argue that this is done at the 
cost of having a relatively small number of significant 
regions declared in comparison to other packages, we 
repeated computations in CSAR with a more relaxed 
error control that gave 80% of false positives (a rate simi- 
lar to the one actually obtained for MACS). In this way, 
717 true positive (common) regions were found for S 6 
(out of 3,597 significant regions), 771 for S 7 (out of 
3,737), and 655 for S 8 (out of 3,307), which is comparable 
with the number of true positives obtained with MACS. 
It is interesting to note that although MACS shows 0% of 
False Positives in the in silico libraries, in the experimen- 
tal libraries, the error increases to an average of 79%. 
MACS (default options) eliminates reads that map to the 
exact same positions and strand above a maximum num- 
ber. For this reason, MACS eliminates the reads added in 
silico since these have the same sequence and therefore 
the same position and strand. In the experimental 
libraries, this strategy apparently did not work out. We 
hypothesize that due to degradation of the DNA frag- 
ments subjected to sequencing or due to sequencing 
errors, the short reads obtained from fragments with the 
same sequence will not always have the exact same posi- 
tions, preventing MACS from eliminate them. In the 
CSAR approach this is not a problem because it requires 
both strands to support the binding event independently. 

Since the percentage of duplicate reads can be easily 
calculated, we advise to always report it as a measure of 
quality in future ChlP-seq experiments. In this study we 
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Figure 2 ChlP-seq method comparison. (A) Proportion of peaks with a CArG-box (CCW 6 GG or CCW 7 G) within a distance of 50 bp among the 
significant regions detected by each method in the comparison of S, to S c . (B) Proportion of peaks detected by each method in the 
comparison of A, to A c with at least one target gene differentially expressed. Only peaks near a gene (3 kb upstream or 1kb downstream) 
represented in the microarray experiments were considered. The list of genes which expression is affected by API was downloaded from [19], 
we used the list denoted "Agilent and_or Operon_BH-0h". Default options for QuEST results in the identification of only 66 significant peaks, 
therefore we used the option "Relaxed peak calling parameters" for Figure 2B. For comparison purposes, all scores reported by the different 
methods were transformed into rank scores with zero as the rank of the most significant peak. 



used the libraries S 6 - S 8 as extreme examples of the 
effect of PCR artefacts, but we advise in general against 
working with high levels of duplication in a normal 
ChlP-seq experiment. Further study should establish 
more precisely which levels of duplication are still 
acceptable. This should be done in combination with 
evaluating other parameters such as the number of 



mapped reads. When working with proteins that bind 
preferentially to promoter regions, we found it useful to 
graphically represent for each experimental library the 
distribution of distances (bp) between the position of 
read-enriched regions and the start position of genes; in 
such graphs one should typically see enrichment in the 
expected positions (e.g.: promoter regions for SEP3 and 



Table 2 Number of significant regions detected 






S, vs S c 


S2-S5 vs S c * 


S 6 vs S c 


S 7 vs S c 


S 8 vs S c 


CSAR 


Total 


3,235 


3,306(5) 


57 


150 


126 




Common 


3,235 


3,226(2.6) 


52 


130 


104 




False Positives 




2% 


9% 


13% 


17% 


Quest 


Total 


985 


989(11) 


5,663 


4,724 


5,709 




Common 


985 


971(4.2) 


440 


433 


422 




False Positives 




2% 


92% 


91% 


92% 


CisGenome 


Total 


2,030 


14,632(30) 


9 


91 


169 




Common 


2,030 


1,633(4) 


1 


24 


23 




False Positives 




89% 


89% 


74% 


86% 


PICS 


Total 


2,846 


1,952(24.7) 


1,256 


1,575 


153 




Common 


2,846 


1,253(5.9) 


382 


435 


51 




False Positives 




36% 


70% 


72% 


67% 


MACS 


Total 


2,728 


2,728(0) 


2,821 


3,687 


3,624 




Common 


2,728 


2728(0) 


631 


761 


716 



False Positives 0% 78% 79% 80% 



*Results for the in s/7/co-modified libraries (S 2 -S 5 ) are summarized with its average and standard deviation (in parenthesis) 
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API TFs). If this is not the case, this might be an indi- 
cation of a problem in the experimental IP enrichment. 
CSAR provides functions to easily calculate and visualize 
this distribution and to report the number of duplicate 
reads. 

In conclusion, the CSAR package, implemented in the 
popular R language, provides an accurate and efficient 
tool for the analysis of plant ChlP-seq data. It shows 
better accuracy compared to other methods in the two 
plant ChlP-seq experiments considered, and, in particu- 
lar, it shows a high level of robustness against PCR-arte- 
facts. A good error rate control is one of the most 
important features of any statistical process, and CSAR 
shows a good control even with a high percentage of 
duplicate reads. 

Availability and requirements 

♦ Project name: CSAR 

♦ Project home page: http://bioconductor.org/ 
packages/release/bioc/html/CSAR.html 

♦ Operating system(s): Platform independent 

♦ Programming language: R 

♦ Other requirements: R version 2.8.1 or superior 

♦ License: Artistic-2.0 

♦ Any restrictions to use by non-academics: None 

♦ The software (source code) and examples are 
attached in Additional file 1. It can also be down- 
loaded via the project home page. 

Additional material 



Additional file 1: CSAR R package source The R package source for 
CSAR (version 1.4.0) is included as additional file. 
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